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Easy access to the internet throughout the world has fully reformed the usage 
of social communication such as Facebook, Twitter, Linked In which are 
becoming a part of our life. Accordingly, cybercrime has become a vital 
problem, especially in developing countries. The dissemination of 
information with no risk of being discovered and fetched leads to an increase 
in cyber-criminal. Meanwhile, the huge amount of data continuously 
produced from Twitter made the discovery process of cyber-criminals is a 
tough assignment. This research will contribute in determined on the build 
the comparable vectors for (positive and negative classes) and then the 
classify incoming tweets to predicate his class (positive or negative). The 
proposed routines staring with the construct super comparable vectors (SCV) 
(positive and negative vectors), and the construct vector for the incoming 
tweet, and then calculate similarities with both SCV and compare calculated 


similarities to predicate class of incoming tweet. In this research, we used 
some common techniques for calculating the weight of terms in tweets to 
construct SCV. To ensure the successful operation of the proposed system, 
we performed a pilot analysis on a real example of an examination. Research 
Improves precision, recall, and Fl values by 87%, 59%, 69.99%, 
respectively. 


TF inverse document frequency 
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1. INTRODUCTION 

The aim of cybersecurity is to secure processes, hardware, software, and information components of 
computer systems during online session from stealing, harm, interruption, and illegal use. In present era, 
social network is the part and parcel in everyday life. Easy access to internet through the world has fully 
reformed the usage of social communication such as Facebook, Twitter, Linked-In which are becoming a part 
of our life. Accordingly, cybercrime has become a vital problem, especially in a developing country. One of 
these cybercrimes is cybersecurity information which contains a variety of unofficial sources, such as social 
media platforms, chat rooms, blogs, and developer forums. This type kind of crime provides information 
about security vulnerabilities, threats, and attacks. In order to secure this information, we need real-time 
intelligence. On cybersecurity threats and vulnerabilities. The proposed approach will carefully examine a 
number of studies that have suggested models for future cybersecurity threats and are based on time series 
and moving averages. 

There are over 150 million users write over 500 million tweets per day for the year 2019 [1]. 
considered text categorization is necessary to get relevant info from such a huge collection tweet and convert 
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the entire large-scale information into a small size subcategory. It can be controlled for analysis. As a result, 
attempts have been made to classify the text of the tweet into several areas to discover information on a 
particular topic. Typical examples are a geolocation user prediction to categorize tweets by geo-political 
location [2]. 

Also, Twitter has 330 million monthly users worldwide [3]. Considered as Turkey fifth country with 
9 million approximate consider active from January 2019 [4]. Political predicting affiliation by rating tweets 
related to politics [5], and predicting crime categorizes Twitter posts based on emotion [6]. However, the 
rating of difficulty in twitter posts in nature because of the length of the twitter posts is limited, i.e., under 
280 characters, and varied types of users engage in informal tweet writing [7]. 

Dion'isio et al. [8]. presents new methodology to improve the search results. His problems are 
cyberattacking is common and big issue because there are many flows of data produce social media such as 
twitter, so most organization resort security information. These systems rely on synchronizing the latest, 
corrections, as well as threats presented by threat feeds. He tried to produce a new architecture separated in 
three stages. Collect data from Twitter by using Twitter application programming interface (API), apply 
filters and normalizes tweets on specific format. Apply binary classifier labels to classify fetched tweets into 
relevant or irrelevant, relevant It is possible that Tweets contain valuable information about an asset and are 
not related otherwise. Finally, the named entity recognition (NER) network processes the relevant tweets 
during the information extraction stage. For example, the information gathered could be utilized to send out a 
security alert. His approach results, the proposed pipeline achieves a true rate positive of 94% on average 
91% rate negative for rating task and Fl- score 92% average for task entity named recognition, via 3 
infrastructures topical research. 

Sabottke et al. [9] shows a vulnerability detection service based on Twitter suggested using the 
supported vector machine (SVM) classification. It could be able to be exploited in a real-life situation a 
fascinating feature The goal of this research is to look into aggressive interference as a way to trick the 
classifier. These recruiters may provide the information in a more organized and formal way at their Twitter 
headquarters. It takes advantage of some of the characteristics of these essays, such as their grammatical 
relationships and relationships. In conjunction with this proposal, Zhou et al. [10] uses the architecture of the 
NER to extract indicators of compromise (IoC) according to cybersecurity reports in contrast to the design. 
Like the one bearing her name. Badjatiya et al. [11] educate other customers’ opinions of copyright on 
SemEval 2015 [12] Twitter. The outdoor equipment introduced to the challenge ranked first in both missions. 

Badjatiya et al. [11] presents an example which is not assertive as, but is not limited to, the path of 
learning for the advice of the fossil idiot who loves you to hate. The authors report that learning techniques 
play an important role in eating. Regarding the applications of learning applications, applications, 
applications, applications, and applications. Wagner et al. [13] long short-term memory (LSTM) architecture 
implemented sequencing to suspend medical entities for public health monitoring. The architecture offered is 
superior to the previous case. As far as the future is concerned, in relation to the infrastructure proposed by 
Lample et al. [14]. This is how we adopted our NER approach [15]. 

Aslan et al. [16] works on Twitter as example, they use machine learning techniques, which 
investigate if accounts social media were importance in terms of cybersecurity. They used the Python 
programming language Crawler Twitter API to fix their dataset for use in their research. Alves et al. [17] 
they introduced many of tested the quantitative evaluation with everyone in mind Twitter posts from 80 
accounts over 8 months (a total of 195,000 tweets), it shows that their methodology came at the right time 
and successfully find the most security-related Tweets related to the IT infrastructure example (value of 
positive measure greater than 90%), incorrectly choose. A little number of related tweets (value of false 
positive measure less than 10%). Duarte et al. [18] introducing a new methodology developed for text 
analysis in English or other widely spoken languages, such as Portuguese using big data [19]. Javed et al. [20] 
introduces what because of this type of shortcut, we can block such harmful websites by clicking on them as 
locker websites. Khandpur et al. [21] shows a new study on detecting cyber-attacks by analysing Twitter 
data. Sohime et al. [22] shows if the investigation period is long enough, cyber attackers have an advantage, 
which is a difficulty for a security analyst. Keep up with the latest risks. 

We aim in this research to produce a mixed methodology to overcome the above problem, we can 
summarize as the following: 

— Collect binary classification dataset. 

— Collect full description from Twitter application programming interface (API) for each Tweet, separation 
dataset into positive and negative pockets. 

— Construct super comparable vector (SCV) for each pocket to generate vector positive, and vector 
negative. 

— Construct vector from entire Tweet. 
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— Calculate similarities between positive and negative vectors. 
— Compare between results. 

The rest of this study is organized as follows: section 2 shows the methodology and the proposed 
algorithm with how to to enhance binary classification using Word2Vec model. Section 3, presents the 
experiment and show the improved happened after applying our proposed algorithm. Conclusion and future 
projects are presented in section 4. 


2. METHODOLOGY 

Our methedology focuses on how to improve the accureccy results of the classification method. 
Therferfore, it uses twitter social network and binary classified dataset to which will be described in the later 
sections. We will take a tour in our proposed algorithm which research try implement to answer suggest 
questions. Also, we will discuss the suggestion techniques which research will be implement. 


2.1. Classification algorithm 

This section tries to covercome the following challenges which are the main part of this 
methodology: 
— Howto enhance binary classification using Word2Vec Model? 
— How to construct super comparable vector SCV? 
In Figure 1, shows the proposed algorithm stages. First this algorithm collects meta-data (full tweet text and tags) 
about tweets by using Twitter API service (this service provided by Twitter itself) for working dataset [23]. 


Class t; Tweet (t;) 
[5] È 
Classified Dataset 


Vector (V;) | Collect Data From Twitter API [1] | 
From t; L ——— 


Separation Layer [2] 


: . Vector (V, 
Classification estor (Vp) 


Vectorize Layer 
Layer [4] 


[3] , - 
= Positive 
m Dataset 


Vector (Vn) Negative 
Dataset 


Figure 1. Algorithm stages 


This dataset consists of three classified datasets Dı, D2, D3. Each raw dataset contains tweet-id, 
classified status only, so we must collect other data from Twitter API. The next step is to separate dataset 
into two pockets one for positive tweets and other for negative pockets to prepare to the next step for each 
pocket (positive and negative) we will construct super vector comparable, we will use some different 
methodologies to construct super comparable vector we will describe its later, now we will be ready to 
compare each dataset with two super vector comparable (positive and negative) by using output of the 
following pockets: 


Vi for tweet i, Vp for super positive vector and Vn for super negative vector 
As showing in vectorize layer to prepare the next step by calculating similarity between previous vectors we 
have new class for each tweet in Di, we will compare a new class and actual class to calculate efficiency for 


each methodology. We will show our architecture in detail in the following sections. But now we can 
summarize our architecture as the following algorithm: 
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— Collect binary classified dataset. 

— Collect full description from Twitter API for each Tweet. 

— Separation dataset into positive and negative pockets. 

— Construct super comparable vector (SCV) for each pocket to generate V, (for positive pocket) and V, (for 
negative pocket). 

— Construct vector V; for tweet ti. 

— Calculate Spi and Sni (Spi similarity between Vj and Vp & Sni similarity between Vj and Vn). 

— Compare between Spi and Sni to predicate class of ti. 


2.2. Classification method 
We will use in our proposal the following technique in vectorize and classification layers: 
— Term frequency technique (TF). 
— Term frequency inverse document frequency technique (TF-IDF). 
— Dominant meaning technique. 
We will add extra step in each above technique, by remove duplicate dimension from both super vector 
comparable positive and negative, this addition will give a change in results significantly. 

For construct super vectors comparable for positive and negative pockets by merging all tweets 
inside each pocket (positive and negative) and calculate weight of each term by some technique, will describe 
later. This process will produce two vectors positive Vp and negative Vn predication. We will use same 
technique to construct super comparable to construct Vj; for tweet ti, to prepare next step. Calculate Spi and Si 
similarity between ti and both V, and Vn by user Euclidean distance equation as the following: 


Spi = similarity(t;, Vy), Sni = similarity(t;,V,), 
where i=1,2,3,...,k and k is dataset size. 
“I 


similarity(x, ý) = ET] 


Depend on Spi and Sni we will get the class of t; to calculate efficacy. Now let us take some brief about the 

techniques which using in this paper: 

— Term frequency (TF) technique by calculate ratio of number of times term occurrence and total number of 
terms in document. 

TF(t) = Number of times termt appears in a tweet 
Total number of terms in a tweet 

— Term frequency inverse document frequency (TF-IDF) technique like TF with cancellation all terms 

which occurred in all documents. 
= Number of tweet term occurred 
TF — IDF (t) = TF (6). log( Total number of tweets ) 

— Dominant meaning method, which is well-known “The set of keywords that fit the intended meaning of 
the target word” [24]. The question is viewed as a goal meaning, as well as certain words that fall within 
that meaning’s scope. It freezes the intended meaning, known as the keyword, then adds or removes slave 
words that explain the meaning [25], [26]. To constructing dominant meaning hierarchical, assume we 
have a dataset made up of n, that C = {Ci}"j=1 is, for each Ci from C represented by a collection of 
documents which trying to describe concept C;, suppose that the collection consists of m documents, that 
is {Ci= Di, j = 1, 2, 3, ..., mi}, each document in this collection consists of a set of k words or D= wij, 1 
= 1, 2, 3, ..., kjterms. The wij s represent word repetition wij occurs in document Dj which slave’s words 
of concept Ci. This frequency is calculated as the number of times that the wı occurs in the Dj. The 
following steps represent the process to choose topN words which can indicate the dominant meaning of 
concept Ci, suppose that word wi, symbolizes concept Ci. 

a) Compute each wij for all i, j. 

b) Suppose that C;; is the frequency of concept Ci, which appears in document Di, where j = 1, 2, 3,..., mi. 

c) Calculate maximum of Cj; for all i, Fe= Max{Cji}™ j-1. 

d) Calculate maximum value of wij for all l, j, Fiw = Max{w4j}™4j-1. 

e) Choose Pis, which satisfies 0 į Fiyjj Fie. 

f) Finally, consider the dominant meaning probability: 
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i 
m Wij 


Pij = Pi; (w;|c:) = O = 1,2, PRY IL and j = 1,2, wey kj 


Now for each concept Ci, we rank the terms of collection {Pi1, Piz, ..., Pim} in decreasing order. As a 
result, the Dominant Meaning of the concept Ci can be represented by the set of words that corresponds to 
the set {Pii, Piz, ..., Pin}; that is: 


Wi= {wit, wi2,..., WiN} 


The set Wj is representing more intended meaning for concept Ci. 


3. RESULTS AND DISCUSSION 

To try answer on questions How to enhance binary classification using Word2Vec Model? And how 
to construct comparable vector? We conduct experiment on collection tweets of binary classified. We will 
talk in detail about dataset used to calculate efficiency. Also, we will take a tour and discuss in detail how 
could answer about suggested questions, and will try answer question "what's our recommendation techniqe 
which gives more accurcny results?" later last section. 


3.1. Dataset 

The datasets consist of three binary classified datasets (Dı, D2, D3) [23], these datasets consist of 
4614, 2127 and 1081 tweets, respectively. Di represented by coma separate value file format (csv file) with 
basic data (tweet-id, classification class). Table 1 shows describe the above training datasets. We will use Di 
where i = 1, 2, 3 to construct super vectors comparable and we will compare it’s with all other datasets 
excluding itself (Dj) and we will describe the results in the next section. 


Table 1. Number of tweets for both pockets 
Dataset D: D2 D; 
Positive 2391 634 453 
Negative 2223 1493 628 
Total 4614 2127 1081 


3.2. Exprimatal results 

We will partition this section into time reality and proposed system efficiency, via discussion the 
results of time and efficiency. As we see the results show the purpose of using reduction technique as extra 
step, significantly more improves. In this research, the traditional method to calculate a weight of any term in 
specific context is term frequency (TF) technique. As we see the results show the purpose of using reduction 
technique as extra step, significantly more improves. 

— TF V.S. TF + reduction Table 2 shows the results of precision and recall results for all techniques. The 
result of first technique TF pure (without adding any extra steps) gives 80% and 91% for precision and 
recall respectively, also TF extra (TF with reduction duplicate dimensions from both super vectors 
comparable) gives 87% and 60% for precision and recall respectively as shown in Table 2 and Table 3. 
The improvement in precision and recall for both TF & TF + reduction techniques is taken around 7% 
and 31% respectively as shown in Table 4 and Table 5. In contrast, the are improvement in time 
performance for both techniques as shown in Figure 2. Figure 2 shows time performance for TF V.V. TF 
+ reduction which tell us TF + reduction more time performance than TF. 

— Dominant meaning (DM) V.S. DM + reduction Table 2 shows the results of precision and recall results 
for all techniques. The result of first technique DM pure gives 80% and 91% for precision and recall 
respectively, also DM + Reduction gives 87% and 60% for precision and recall respectively as shown in 
Table 2 and Table 3. The improvement in precision and recall for both DM & DM + Reduction 
techniques is taken around 7% and 31% respectively as shown in Table 4 and Table 5. In contrast, the are 
improvement in time performance for both techniques as shown in Figure 3. Figure 3 shows time 
performance for DM v.s. DM + reduction which tell us DM + reduction more time performance than DM. 

— TF-IDF V.S. TF-IDF + reduction Table 2 shows the results of precision and recall results for all 
techniques. The result of first technique TF-IDF pure (without adding any extra steps) gives 83% and 
87% for precision and recall respectively, also TF-IDF reduction gives 91% and 59% for precision and 
recall respectively as shown in Table 2 and Table 3. The improvement in precision and recall for both TF- 
IDF & TF-IDF + reduction techniques is taken around 4% and 32% respectively as shown in Table 4 and 
Table 5. In contrast, the are improvement in time performance for both techniques as shown in 
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Figure 4. Figure 4 shows time performance for TF-IDF v.s. TF-IDF + Reduction which tell us TF-IDF + 
reduction more time performance than TF-IDF. 


Table 2. Precision, recall, and Fl summary for all techniques 


Technique Precision Recall Fl 
TF 80.00% 91.00% 82.23% 
TF Reduction 87.00% 60.00% 70.34% 
DM 80.00% 91.00% 82.23% 
DM Reduction 87.00% 60.00% 70.34% 
TF-IDF 83.00% 91.00% 86.36% 


TF-IDF Reduction 87.00% 59.00% 69.99% 


Table 3. Precision, recall, and Fl summary for all techniques 


Techni Dataset Di D2 Ds 
eae a atasel Precision Recall Fl Precision Recall Fl Precision Recall F1 
TF D1 64.00% 76.00% 69.49% 76.00% 88.00% 81.56% 


D2 75.00% 91.00% 82.23% - - - 74.00% 91.00% 81.62% 

D3 80.00% 81.00% 80.50% 68.00% 72.00% 69.94% - - - 

TF Reduction D1 - - - 76.00% 50.00% 60.32% 80.00% 49.00% 60.78% 

D2 87.00% 39.00% 53.86% - - - 82.00% 36.00% 50.03% 

D3 85.00% 60.00% 70.34% 76.00% 53.00% 62.45% - - - 

DM D1 - - - 64.00% 76.00% 69.49% 76.00% 88.00% 81.56% 

D2 75.00% 91.00% 82.23% - - - 74.00% 91.00% 81.62% 

D3 80.00% 81.00% 80.50% 68.00% 72.00% 69.94% - - - 

DM Reduction D1 - - 76.00% 50.00% 60.32% 80.00% 49.00% 60.78% 

D2 87.00% 39.00% 53.86% - - - 82.00% 36.00% 50.03% 

D3 85.00% 60.00% 70.34% 76.00% 53.00% 62.45% - - - 

TF-IDF D1 - - 72.00% 81.00% 76.24% 83.00% 90.00% 86.36% 

D2 78.00% 91.00% 84.00% - - - 77.00% 91.00% 83.42% 

D3 81.00% 83.00% 81.99% 70.00% 73.00% 71.47% - - - 

- - - 76.00% 50.00% 60.32% 80.00% 49.00% 60.78% 

TF-IDF Reduction D2 87.00% 39.00% 53.86% - - - 82.00% 35.00% 49.06% 
D3 86.00% 59.00% 69.99% 78.00% 52.00% 62.40% - - - 


Table 4. The improvements values in precision Table 5. The improvements values in recall 
Technique Precision Improves Technique Recall Improves 
TF 80.00% 7.00% TF 91.00% 31.00% 
TF Reduction 87.00% TF Reduction 60.00% 
DM 80.00% 7.00% DM 91.00% 31.00% 
DM Reduction 87.00% DM Reduction 60.00% 
TF-IDF 83.00% 4.00% TF-IDF 91.00% 32.00% 
TF-IDF Reduction 87.00% TF-IDF Reduction 59.00% 
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Figure 2. TF 1 and 2 time comparison chart 
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As shown in Figure 5 and Table 3, the improvement in time is increased quickly in the TF and TF + 
reduction technique, DM and TF-IDF pure more time efficient than TF pure, but TF reduction close from 
other techniques. It is interesting to note that the highest improvement in TD-IDF in both pure and reduction 
from time performance. Finally, above discussion and measurement values which lead us to our proposed 
methodology enhance precision 87% and recall 59%, in future we will work to enhance above vales by 
applying or mix another technique. 
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4. CONCLUSION 

The change in degradation for each year over the past four decades is gradually increased. In this 
research we used some common techniques for calculate a weight of terms in tweet to construct comparable 
vectors and compute similarity between these vectors and input tweet to predicate his class (positive or 
negative). The experimental results tell us the best improvement time in dominant meaning (DM) and term 
frequency inverse document frequency (TD-IDF) more than term frequency (TF) from excremental results. 
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